{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting embeddings from pre-trained BERT \n",
    "\n",
    "Let us understand how to extract embeddings from pre-trained BERT with an example. Consider the sentence - 'I love Paris'. Say, we need to extract the contextual embedding of each word in the sentence. To do this, first, we tokenize the sentence and feed the tokens to the pre-trained BERT which will return the embeddings for each of the tokens. Apart from obtaining the token-level (word-level) representation, we can also obtain the sentence level representation. \n",
    "\n",
    "In this section let us understand how exactly we can extract the word level and sentence level embedding from the pre-trained BERT model in detail. \n",
    "\n",
    "Let's suppose, we want to perform a sentiment analysis task and say we have the dataset as shown below:\n",
    "\n",
    "\n",
    "![title](images/2.png)\n",
    "\n",
    "As we can observe from the preceding table, we have sentences and their corresponding label where 1 indicates positive sentiment and 0 indicates negative sentiment. We can train a classifier to classify the sentiment of a sentence using the given dataset. \n",
    "\n",
    "But we can't feed the given dataset directly to a classifier, since it has text. So first, we need to vectorize them. We can vectorize the text using methods like TF-IDF, Word2vec, and so on. In the previous chapter, we learned how BERT learns the contextual embedding unlike other context-free embedding models like word2vec. Now, we will see how to use the pre-trained BERT for vectorizing the sentences given in our dataset. \n",
    "\n",
    "Let's take the first sentence in our dataset- 'I love Paris'. First, we tokenize the sentence using the WordPiece tokenizer and get the tokens (words). After tokenizing the sentence, we have:\n",
    "\n",
    "tokens = [I, love, Paris]\n",
    "\n",
    "Now, we add the [CLS] token at the beginning and [SEP] token at the end. Thus, our tokens list become:\n",
    "\n",
    "tokens = [ [CLS], I, love, Paris, [SEP] ]\n",
    "\n",
    "Similarly, we can tokenize all the sentences in our training set. But the length of each sentence varies right? Yes and so does the length of the tokens. We need to keep the length of all the tokens the same. Say, we keep the length of the tokens to 7 for all the sentences in our dataset. If we look at our preceding tokens list, the tokens length is 5. To make the tokens length to 7, we add a new token called [PAD]. Thus, now our tokens become:\n",
    "\n",
    "tokens = [ [CLS], I, love, Paris, [SEP], [PAD], [PAD] ]\n",
    "\n",
    "As we can observe, now our tokens length is 7 by adding two [PAD] tokens. The next step is we need to let our model understand that the [PAD] token is added only to match the tokens length and it is not part of the actual tokens. To do this, we introduce an attention mask. We set the attention mask value to 1 in all positions and 0 to the position where we have [PAD] token as shown below:\n",
    "\n",
    "attention_mask =  [ 1,1,1,1,1,0,0]\n",
    "\n",
    "Next, we map all the tokens to a unique token ID. Suppose, the following is the mapped token ID:\n",
    "\n",
    "token_ids = [101, 1045, 2293, 3000, 102, 0, 0]\n",
    "\n",
    "It implies that id 101 indicates the token [CLS], 1045 indicates the token 'I', 2293 indicates the token 'Paris', and so on. \n",
    "\n",
    "Now, we feed the token_ids along with the attention_mask as an input to the pre-trained BERT and obtain the vector representation (embedding) of each of the tokens.\n",
    "\n",
    "The below figure shows how we use pre-trained BERT for obtaining the embedding. For clarity, the tokens are shown instead of token ids. As we can notice, once we feed the tokens as the input, encoder 1 computes the representation of all the tokens and sends it to the next encoder which is encoder 2. Encoder 2 takes the representation computed by encoder 1 as input and computes its representation and sends it to the next encoder which is encoder 3. In this way, each encoder sends its representation to the next encoder above it. The final encoder which is encoder 12 returns the final representation (embedding) of all the tokens in our sentence:\n",
    "\n",
    "\n",
    "![title](images/3.png)\n",
    "\n",
    "As shown in the preceding figure, $R_{[CLS]}$ is the embedding of the token [CLS], $R_I$ is the embedding of the token I, $R_{love}$ is the embedding of the token love, and so on. Thus, in this way, we can obtain the representation of each of the tokens. These representations are basically the contextualized word(token) embedding. Say, we are using the pre-trained BERT-base, then in that case, the representation size of each token is 768.\n",
    "\n",
    "We learned how to obtain the representation for each word in the given sentence 'I love Paris'. But how to obtain the representation of the complete sentence? \n",
    "\n",
    "We learned that we have prepended the [CLS] token to the beginning of our sentence. The representation of the [CLS] token will hold the aggregate representation of the complete sentence. So, we can ignore the embedding of all other tokens and take the embedding of [CLS] token and assign them as a representation of our sentence. Thus, the representation of our sentence 'I love Paris' is just the representation of the [CLS] token . \n",
    "\n",
    "In a very similar fashion, we can compute the vector representation of all the sentences in our training set. Once, we have the sentence representation of all the sentences in our training set then we can feed those representations as input and train a classifier to perform a sentiment analysis task. \n",
    "\n",
    "Note that using the representation of the [CLS] token as the sentence representation is not always a good idea. The efficient way to obtain the representation of a sentence is either averaging or pooling the representation of all the tokens. We will learn more about this in the upcoming chapters. \n",
    "\n",
    "Now that, we have learned how to use the pre-trained BERT for extracting embedding (representation), in the next section, we will learn how to implement this using a library known as transformers. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
